Intelligent Systems Group University of Bristol Additional material for the Ukwabelana Zulu corpus

نویسندگان

  • Sebastian Spiegler
  • Andrew van der Spuy
  • Peter A. Flach
چکیده

In this document we describe the scheme used for labelling the open-source Ukwabelana Zulu corpus as well as the rules employed for the Part-of-speech (POS) tagger used to assign POS to morphologically analysed words. A detailed description of the Zulu morphology, the corpus itself and its generation is given in [2]. All resources can be downloaded from http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/ Resources/. 1 The labelling scheme of the Ukwabelana Zulu corpus In this section we give describe the labelling scheme of the Ukwabelena corpus by listing all labels used with a short description, its frequency in the corpus and an example. Label Description Freq. Example adverb 351 abenjalo adverb prefix 38 ikakhulu adjective root 382 aluhlaza aspect 2914 akagangile aspect + verb root 2 weza augmentative 3 yindilingakazi conjunction 21 kokuba demonstrative 147 angalesosikhathi demonstrative agreement class 1 27 bakulo demonstrative agreement class 2 17 balaba demonstrative agreement class 3 17 bakulowaya demonstrative agreement class 4 9 lena demonstrative agreement class 5 17 ekulelibutho demonstrative agreement class 6 13 kulawomadwala demonstrative agreement class 7 29 angalesosikhathi demonstrative agreement class 9 54 bale demonstrative agreement class 10 17 kuyilezo demonstrative agreement class 11 8 salolu demonstrative agreement class 14 5 lobo demonstrative agreement class 15 26 ilokho derivational morpheme 70 abazala diminutive 106 amadodana feminine 50 amadodakazi futurative 571 asizosatshiswa negative subject prefix class 1 60 akazi negative subject prefix first person plural 14 asazike negative subject prefix first person singular 59 angazanga negative subject prefix class 2 27 abazi negative subject prefix second person plural 2 anisenayo negative subject prefix second person singular 23 akwazi negative subject prefix class 3 2 awusalindi negative subject prefix class 4 6 ayikhulumi negative subject prefix class 5 7 aliboni negative subject prefix class 6 9 awazi negative subject prefix class 7 7 asazi negative subject prefix class 9 26 ayibalulekile negative subject prefix class 10 9 azifani negative subject prefix class 11 2 alushi negative subject prefix class 15 56 akukhethi hortative 109 akenze human 83 abadlali indicative subject prefix class 1 964 izinyawowafika indicative subject prefix first person plural 246 asashiya indicative subject prefix first person singular 496 angilandiwe indicative subject prefix class 2 497 abangathanda indicative subject prefix second person plural 38 enanimbiza indicative subject prefix second person singular 520 owabe indicative subject prefix class 3 294 awazange indicative subject prefix class 4 255 eyahlukene indicative subject prefix class 5 164 elalingene indicative subject prefix class 6 369 abemgwema indicative subject prefix class 6 + verb root 2 aziwa indicative subject prefix class 7 154 asethuse indicative subject prefix class 9 343 ayabuzisana indicative subject prefix class 10 209 azifisayo indicative subject prefix class 11 49 eyolubeka indicative subject prefix class 14 29 baba indicative subject prefix class 15 563 akukhumbule imperative 67 bamba inanimate 246 anesisindo interrogative 80 athini interjection 24 awu initial vowel of the noun 1785 anesisindo initial vowel + noun class 1 192 benosikhuni initial vowel + noun class 2 25 babengobani initial vowel + noun class 3 48 esinobhanana initial vowel + noun class 5 245 ayishumi initial vowel + noun class 9 40 eyinyanga initial vowel + noun class 11 96 enolaka initial vowel + noun class 14 7 nobovu the suffix ke 9 asazike locative prefix 540 asezisefweni locative suffix 247 asezisefweni modal/auxiliary root 160 awuzange noun class 1 226 akasemubi noun class 2 107 abadlali noun class 3 228 amanhlakomuzi noun class 4 85 ayiminingi noun class 5 23 balishum noun class 6 297 ayesemaningi noun class 7 293 anesisindo noun class 9 445 ayenentukuthelo noun class 10 324 anezimali noun class 11 26 angilutho noun class 14 63 bobugazagaza noun class 15 551 awukuzwe negative 689 angakaya noun root 2590 anezimali object prefix class 1 470 angimgijimele object prefix first person plural 59 abesibeke object prefix first person singular 170 akungiphe object prefix class 2 105 ababona object prefix second person plural 10 kuniphathe object prefix second person singular 94 asakulindile object prefix class 3 86 awunyakazisa object prefix class 4 156 ayenze object prefix class 5 74 alibhekisisa object prefix class 6 75 awathole object prefix class 7 47 asibekela object prefix class 9 236 asiyiyeke object prefix class 10 201 asizethwele object prefix class 11 19 aluphethe object prefix class 14 15 abuveze object prefix class 15 226 abekucele optative 5 abobuya preposition 1405 anesisindo participial subject prefix class 1 517 ebe participial subject prefix class 1 + verb root 130 asebesebenza participial subject prefix first person plural 89 besimlindele participial subject prefix first person singular 227 bengazi participial subject prefix class 2 178 ababejayiva participial subject prefix second person plural 18 enanimbiza participial subject prefix second person singular 194 osubhaliwe participial subject prefix class 3 95 sewuqale participial subject prefix class 4 83 ibe participial subject prefix class 5 91 belikade participial subject prefix class 6 170 abemgwema participial subject prefix class 7 48 besesenzeka participial subject prefix class 9 142 beyikhuluma participial subject prefix class 10 113 azifisayo participial subject prefix class 11 21 beluqondeni participial subject prefix class 14 9 babukhona participial subject prefix class 15 307 bekulo past tense morpheme 169 abemgwema (imperative) plural 10 dlanini position 2 78 angalesosikhathi position 3 22 bakulowaya potential 14 angavuma pronoun class 1 39 akanaye pronoun first person plural 34 abafowethu pronoun first person singular 32 ami pronoun class 2 34 ayekubo pronoun second person plural 7 kini pronoun second person singular 26 akukwakho pronoun class 3 16 ayesenawo pronoun class 4 4 layo pronoun class 5 23 ayeyilo pronoun class 6 6 awodwa pronoun class 7 10 iwaso pronoun class 9 38 ayekuyo pronoun class 10 18 akuzona pronoun class 11 2 lona pronoun class 14 7 bona pronoun class 15 13 akuyikho presentative 16 nasebabanango presentative agreement class 1 4 nanguya presentative agreement class 2 1 nampa presentative agreement class 5 2 nanto presentative agreement class 6 1 nanka presentative agreement class 7 1 nasi presentative agreement class 0 1 nansi presentative agreement class 10 3 nazo presentative agreement class 15 3 nakho quantifier root 31 awodwa relative morpheme 1205 amahlanu reduplication 31 amangelengele reflexive object prefix 78 asiziyeke relative suffix 146 abakufisayo subjunctive subject prefix class 1 167 akenze subjunctive subject prefix first person plural 55 asibonge subjunctive subject prefix first person singular 80 angibange subjunctive subject prefix class 2 86 abafunde subjunctive subject prefix second person plural 14 nibe subjunctive subject prefix second person singular 80 awubange subjunctive subject prefix class 3 25 awugcobe subjunctive subject prefix class 4 19 ibe subjunctive subject prefix class 5 15 aliphakamise subjunctive subject prefix class 6 96 awasale subjunctive subject prefix class 6 + verb root 1 andise subjunctive subject prefix class 7 23 asihambe subjunctive subject prefix class 9 34 ayibheke subjunctive subject prefix class 10 24 azibulale subjunctive subject prefix class 11 2 lube subjunctive subject prefix class 14 2 buthule subjunctive subject prefix class 15 60 akube stabilizer 114 akabona verb suffix 4768 akezwa negative verb suffix 317 ayengasaphumi vocative prefix 1 webantu negative perfect verb suffix 42 ayengakhishwanga perfect verb suffix long form 436 ahlakaniphile perfect verb suffix short form 1763 abesibeke verb root 8544 akenze subjunctive verb suffix 642 akenze word 1 kwashayansumoni applied extension 626 akwenzele causative extension 519 aliphakamise intensive extension 19 acabangisise neuter extension 345 abonakale passive extension 766 asizosatshiswa reciprocal extension 163 angamelana possessive morpheme 2 mnganami possessive agreement class 1 118 abafowethu possessive agreement class 2 46 abantababantu possessive agreement class 3 42 ka possessive agreement class 4 61 ka possessive agreement class 5 112 elakwambabo possessive agreement class 6 27 awebhubesi possessive agreement class 6 + initial vowel 19 amafutha possessive agreement class 6 + initial vowel + noun class 1 1 onkosazana possessive agreement class 6 + initial vowel + noun class 3 1 ogwayi possessive agreement class 7 68 ingakwesikababa possessive agreement class 9 104 eyezitha possessive agreement class 10 67 ezemisebenzi possessive agreement class 11 15 lukamamncube possessive agreement class 14 16 bakulowaya possessive agreement class 15 160 basemvakwami 2 Noun classes in Zulu There are twelve noun classes in Zulu. These classes are numbered 1–7, 9, 10, 11, 14, 15. Typically, the classes are identified by distinctive noun prefixes (the second prefix in the following words). However, some classes have members which lack the noun prefix; in the case of nouns of classes 5 and 11, the prefix is lacking completely. Class 2 has an alternate form where the initial vowel and the noun prefix have become fused. A lacking prefix is indicated by ∅ in the following examples below. noun class example noun class example 1 u-mu-ntu ‘person’ 2 a-ba-ntu ‘people’ u-∅-baba ‘father o-baba ‘fathers’ 3 u-mu-zi ‘village’ 4 i-mi-zi ‘villages’ u-∅-nogwaja ‘hare’ 2 o-nogwaja ‘hares’ 5 i-∅-gama ‘name’ 6 a-ma-gama ‘names’ 7 i-si-tsha ‘dish’ 10 i-zi-tsha ‘dishes’ 9 i-m-pala ‘impala’ 10 i-zim-pala ‘impalas’ i-∅-khwaya ‘choir’ 6 a-ma-khwaya ‘choirs’ 11 u-∅-phondo ‘horn’ 10 i-zim-pondo ‘horns’ 14 u-bu-hle ‘beauty’ u-∅-tshani ‘grass’ 15 u-ku-dla ‘food’ The numbering system was devised by [1], and reflects the historical affinities between Zulu and other Bantu languages: Zulu lacks classes 8, 12 and 13, which are found in other Bantu languages. In the labels used on the database, morphemes that command or show agreement have been labeled as , where x is a letter or sequence of letters, and n is a number: thus the morpheme min mfundi is labeled , as it marks the noun as belonging to noun class 1. The morpheme siin engisifundisile is marked , as it shows object agreement with a noun of class 7. 3 Part-of-speech tagging based on the morphological structure of a word In this section we list a set of 34 rules which were provided by a linguistic expert and used to assign the part-ofspeech (POS) tag to a word whose morphological structure is known. Zulu words can often be POS-tagged by their first morpheme alone. Sometimes, however, the correct tagging can only be found from a combination of the first morpheme (morpheme1), and one other morpheme, which may be referred to as the identifier (abbreviated J ). There may be several J morphemes in a word, but it is always the leftmost one which identifies the word (jMorpheme1). The J morpheme may be separated from the first morpheme by several other morphemes. J morphemes are [, , , , , , , , , , , , , , , , ]. Furthermore, we introduce the variable X which is a place holder for the following values: [1 15, 1s, 2s, 1p, 2p]. Remark: Elements in a list correspond to a disjunction (logic OR). For instance a rule if morpheme1 = and jMorpheme1 = [, ] then label x, means if morpheme1 is and jMorpheme1 is or then the label is x. Rules are in no particular order, however, the identification by first morpheme and first J morpheme has to be performed before the identification by first morpheme only. 3.1 Identification by first morpheme and first J morpheme 1. if morpheme1 = and jMorpheme1 =

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ukwabelana - An open-source morphological Zulu corpus

Zulu is an indigenous language of South Africa, and one of the eleven official languages of that country. It is spoken by about 11 million speakers. Although it is similar in size to some Western languages, e.g. Swedish, it is considerably under-resourced. This paper presents a new open-source morphological corpus for Zulu named Ukwabelana corpus. We describe the agglutinating morphology of Zul...

متن کامل

Machine learning for the analysis of morphologically complex languages

This thesis demonstrates that machine learning can be applied in different ways to automate the analysis of morphologically complex agglutinating languages. Firstly, the target language Zulu, an under-resourced indigenous language of South Africa, is characterised before presenting the UKWABELANA CORPUS. The morphological Zulu corpus has been semiautomatically compiled in close cooperation with...

متن کامل

Simultaneous Lot Sizing and Scheduling in a Flexible Flow Line

This paper breaks new ground by modelling lot sizing and scheduling in a flexible flow line (FFL) simultaneously instead of separately. This problem, called the ‘General Lot sizing and Scheduling Problem in a Flexible Flow Line’ (GLSP-FFL), optimizes the lot sizing and scheduling of multiple products at multiple stages, each stage having multiple machines in parallel. The objective is to satisf...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

The Zulu locative prefix ku- revisited: a corpus-based approach

This article re-examines the distribution of the class 17 locative prefix kuand its variants kwiand koin the locativisation of nouns in Zulu. To this end an electronic corpus of 5 million running Zulu words — the University of Pretoria Zulu Corpus (PZC) — is queried. We indicate how PZC can be used to highlight previously under-emphasised and overlooked aspects of a seemingly well-documented la...

متن کامل

Zero-Error Communication via Quantum Channels, Noncommutative Graphs, and a Quantum Lovász Number

Runyao Duan,1, 2 Simone Severini,3 and Andreas Winter4, 5 Centre for Quantum Computation and Intelligent Systems (QCIS), Faculty of Engineering and Information Technology, University of Technology, Sydney NSW2007, Australia State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technolog...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010